Variable Selection Bias in Classification Trees Based on Imprecise Probabilities

نویسنده

  • Carolin Strobl
چکیده

Classification trees are a popular statistical tool with multiple applications. Recent advancements of traditional classification trees, such as the approach of classification trees based on imprecise probabilities by Abellán and Moral (2005), effectively address their tendency to overfitting. However, another flaw inherent in traditional classification trees is not eliminated by the imprecise probability approach: Due to a systematic finite sample-bias in the estimator of the entropy criterion employed in variable selection, categorical predictor variables with low information content are preferred if they have a high number of categories. Mechanisms involved in variable selection in classification trees based on imprecise probabilities are outlined theoretically as well as by means of simulation studies. Corrected estimators are proposed, which prove to be capable of reducing estimation bias as a source of variable selection bias.

منابع مشابه

Variable Selection in Classification Trees Based on Imprecise Probabilities

Classification trees are a popular statistical tool with multiple applications. Recent advancements of traditional classification trees, such as the approach of classification trees based on imprecise probabilities by Abellán and Moral (2004), effectively address their tendency to overfitting. However, another flaw inherent in traditional classification trees is not eliminated by the imprecise ...

متن کامل

Statistical Sources of Variable Selection Bias in Classification Tree Algorithms Based on the Gini Index

Evidence for variable selection bias in classification tree algorithms based on the Gini Index is reviewed from the literature and embedded into a broader explanatory scheme: Variable selection bias in classification tree algorithms based on the Gini Index can be caused not only by the statistical effect of multiple comparisons, but also by an increasing estimation bias and variance of the spli...

متن کامل

A bias correction algorithm for the Gini variable importance measure in classification trees

This paper considers a measure of variable importance frequently used in variable selection methods based on decision trees and tree-based ensemble models, like CART, Random Forests and Gradient Boosting Machine. It is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned. Some authors showed that thi...

متن کامل

MASTER THESIS by Paul Fink Ensemble methods for classification trees under imprecise probabilities

In this master thesis some properties of bags of imprecise classification trees, as introduced in Abellán and Masegosa (2010), are analysed. In the beginning the statistical background of imprecise classification trees is outlined – starting with an overview on measuring uncertainty within the concept of Dempster–Shafer theory is presented, followed by a discussion of its application in a tree–...

متن کامل

Improving the Naive Bayes Classifier via a Quick Variable Selection Method Using Maximum of Entropy

Variable selection methods play an important role in the field of attribute mining. The Naive Bayes (NB) classifier is a very simple and popular classification method that yields good results in a short processing time. Hence, it is a very appropriate classifier for very large datasets. The method has a high dependence on the relationships between the variables. The Info-Gain (IG) measure, whic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007